Micro-benchmarking the GT200 GPU

نویسندگان

  • Misel-Myrto Papadopoulou
  • Maryam Sadooghi-Alvandi
  • Henry Wong
چکیده

Graphics processors (GPU) are interesting for nongraphics parallel computation because of the potential for more than an order of magnitude of speedup over CPUs. Because the GPU is often presented as a C-like abstraction like Nvidia’s CUDA, little is known about the hardware architecture of the GPU beyond the high-level descriptions documented by the manufacturer. We develop a suite of micro-benchmarks to measure the CUDA-visible architectural characteristics of the Nvidia GT200 (GTX280) GPU. We measure properties of the arithmetic pipelines, the stack-based handling of branch divergence, and the warp-granularity operation of the barrier synchronization instruction. We confirm that global memory is uncached with ∼441 clock cycles of latency, and measure parameters of the three levels of instruction and constant caches and three levels of TLBs. We succeed in revealing more detail about the GT200 architecture than previously disclosed.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fully accelerating quantum Monte Carlo simulations of real materials on GPU clusters

Continuum quantum Monte Carlo (QMC) has proved to be an invaluable tool for predicting the properties of matter from fundamental principles. By solving the manybody Schrödinger equation through a stochastic projection, it achieves greater accuracy than mean-field methods and better scalability than quantum chemical methods, enabling scientific discovery across a broad spectrum of disciplines. T...

متن کامل

Measuring Performance, Estimating Most Productive Scale Size, and Benchmarking of Hospitals Using DEA Approach: A Case Study in Iran

Background and Objectives: The goal of current study is to evaluate the performance of hospitals and their departments. This manuscript aimed at estimation of the most productive scale size (MPSS), returns to scale (RTS), and benchmarking for inefficient hospitals and their departments. Methods: The radial and non-radial data envelopment analysis (DEA) ap...

متن کامل

A GPU Implementation of the Complex Logarithmic Number System

In this paper we present a technique to implement the Complex Logarithmic Number System (CLNS) on a Graphics Processing Unit (GPU). Although CLNS multiplication is a simple FP addition, CLNS addition involves evaluations of transcendental functions, which can be carried out in a few different ways by utilizing the GPU hardware resources, such as the special function units, the floating point un...

متن کامل

Measuring the Impact of Configuration Parameters in CUDA Through Benchmarking

The threadblock size and shape choice is one of the most important user decisions when a parallel problem is coded to run in GPU architectures. In fact, threadblock configuration has a significant impact on the global performance of the program. Unfortunately, the programmer has not enough information about the subtle interactions between this choice of parameters and the underlying hardware. T...

متن کامل

-body Simulation on cluster of GPUs

We present the results of a hierarchical N -body simulation on DEGIMA, a cluster of PCs with 576 graphic processing units (GPUs) and using an InfiniBand interconnect. DEGIMA stands for DEstination for GPU Intensive MAchine, and is located at Nagasaki Advanced Computing Center (NACC), Nagasaki University. In this work, we have upgraded DEGIMA’s interconnect using InfiniBand. DEGIMA is composed b...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009